Mounting the Google Drive¶
# Access to the file that is available on Google Drive
print("Mounting Google Drive...")
from google.colab import drive
drive.mount('/content/drive')
!mkdir /content/drive/MyDrive/data_processing/
download_folder = "/content/drive/MyDrive/CSULB THESIS 2025/Thesis Dataset"
Mounting Google Drive... Mounted at /content/drive mkdir: cannot create directory ‘/content/drive/MyDrive/data_processing/’: File exists
Import the libaries¶
# for basic operations
import numpy as np
import pandas as pd
from scipy import stats
# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
# plt.style.use('fivethirtyeight')
from mpl_toolkits.mplot3d import Axes3D # Import for 3D plotting
import plotly.express as px
# for modeling
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, train_test_split
# from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.svm import OneClassSVM
from sklearn.model_selection import train_test_split
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, matthews_corrcoef
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict, cross_validate
from sklearn.metrics import precision_recall_curve, matthews_corrcoef, make_scorer
from sklearn.metrics import (
confusion_matrix, f1_score, average_precision_score,
precision_recall_curve, ConfusionMatrixDisplay
)
# Autoencoder
import torch
import torch.nn as nn
import torch.optim as optim
from imblearn.over_sampling import SMOTE
# to avoid warnings
import warnings
warnings.filterwarnings('ignore')
warnings.warn("this will not show")
import torch
import random
# Define your desired random seed
MANUAL_SEED = 42
def set_all_seeds(seed):
"""Sets seed for reproducibility across all relevant libraries."""
# 1. NumPy seed
np.random.seed(seed)
# 2. Python standard random module seed
random.seed(seed)
# 3. PyTorch seeds
torch.manual_seed(seed)
# 4. PyTorch CUDA/GPU seeds (if available)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # for multi-GPU
# Optional: Disable non-deterministic operations for better reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Call the function to set the seeds
set_all_seeds(MANUAL_SEED)
print(f"Random seed set to {MANUAL_SEED} for all operations.")
Random seed set to 42 for all operations.
Inventory & understand the data¶
About Dataset Context Manufacturing process feature selection and categorization
Content Abstract: Data from a semi-conductor manufacturing process
Data Set Characteristics: Multivariate Number of Instances: 1567 Area: Computer Attribute Characteristics: Real Number of Attributes: 591 Date Donated: 2008-11-19 Associated Tasks: Classification, Causal-Discovery Missing Values? Yes A complex modern semi-conductor manufacturing process is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. It is often the case that useful information is buried in the latter two. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs.
To enhance current business improvement techniques the application of feature selection as an intelligent systems technique is being investigated.
The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where .1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point.
Using feature selection techniques it is desired to rank features according to their impact on the overall yield for the product, causal relationships may also be considered with a view to identifying the key features.
Results may be submitted in terms of feature relevance for predictability using error rates as our evaluation metrics. It is suggested that cross validation be applied to generate these results. Some baseline results are shown below for basic feature selection techniques using a simple kernel ridge classifier and 10 fold cross validation.
Baseline Results: Pre-processing objects were applied to the dataset simply to standardize the data and remove the constant features and then a number of different feature selection objects selecting 40 highest ranked features were applied with a simple classifier to achieve some initial results. 10 fold cross validation was used and the balanced error rate (*BER) generated as our initial performance metric to help investigate this dataset.
SECOM Dataset: 1567 examples 591 features, 104 fails
Import dataset
secom = pd.read_csv("/content/drive/MyDrive/CSULB THESIS 2025/Thesis Dataset/uci-secom.csv")
secom.head()
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | NaN | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 208.2045 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 82.8602 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 73.8432 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | NaN | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
5 rows × 592 columns
secom.describe()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1561.000000 | 1560.000000 | 1553.000000 | 1553.000000 | 1553.000000 | 1553.0 | 1553.000000 | 1558.000000 | 1565.000000 | 1565.000000 | ... | 618.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1567.000000 |
| mean | 3014.452896 | 2495.850231 | 2200.547318 | 1396.376627 | 4.197013 | 100.0 | 101.112908 | 0.121822 | 1.462862 | -0.000841 | ... | 97.934373 | 0.500096 | 0.015318 | 0.003847 | 3.067826 | 0.021458 | 0.016475 | 0.005283 | 99.670066 | -0.867262 |
| std | 73.621787 | 80.407705 | 29.513152 | 441.691640 | 56.355540 | 0.0 | 6.237214 | 0.008961 | 0.073897 | 0.015116 | ... | 87.520966 | 0.003404 | 0.017180 | 0.003720 | 3.578033 | 0.012358 | 0.008808 | 0.002867 | 93.891919 | 0.498010 |
| min | 2743.240000 | 2158.750000 | 2060.660000 | 0.000000 | 0.681500 | 100.0 | 82.131100 | 0.000000 | 1.191000 | -0.053400 | ... | 0.000000 | 0.477800 | 0.006000 | 0.001700 | 1.197500 | -0.016900 | 0.003200 | 0.001000 | 0.000000 | -1.000000 |
| 25% | 2966.260000 | 2452.247500 | 2181.044400 | 1081.875800 | 1.017700 | 100.0 | 97.920000 | 0.121100 | 1.411200 | -0.010800 | ... | 46.184900 | 0.497900 | 0.011600 | 0.003100 | 2.306500 | 0.013425 | 0.010600 | 0.003300 | 44.368600 | -1.000000 |
| 50% | 3011.490000 | 2499.405000 | 2201.066700 | 1285.214400 | 1.316800 | 100.0 | 101.512200 | 0.122400 | 1.461600 | -0.001300 | ... | 72.288900 | 0.500200 | 0.013800 | 0.003600 | 2.757650 | 0.020500 | 0.014800 | 0.004600 | 71.900500 | -1.000000 |
| 75% | 3056.650000 | 2538.822500 | 2218.055500 | 1591.223500 | 1.525700 | 100.0 | 104.586700 | 0.123800 | 1.516900 | 0.008400 | ... | 116.539150 | 0.502375 | 0.016500 | 0.004100 | 3.295175 | 0.027600 | 0.020300 | 0.006400 | 114.749700 | -1.000000 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 100.0 | 129.252200 | 0.128600 | 1.656400 | 0.074900 | ... | 737.304800 | 0.509800 | 0.476600 | 0.104500 | 99.303200 | 0.102800 | 0.079900 | 0.028600 | 737.304800 | 1.000000 |
8 rows × 591 columns
# 1. Count the values and map the labels for the plot index
counts = secom['Pass/Fail'].value_counts().rename({
-1: 'Pass (Good)',
1: 'Fail (Defective)'
})
# 2. Create the plot
plt.figure(figsize=(6, 4))
counts.plot(kind='bar', color=['green', 'red'], rot=0)
# 3. Add titles and show
plt.title('Product Status Distribution')
plt.ylabel('Count')
plt.show()
print("\n--- Numerical Class Counts ---")
print(counts)
--- Numerical Class Counts --- Pass/Fail Pass (Good) 1463 Fail (Defective) 104 Name: count, dtype: int64
1. Ingest & clean — types, missing values, outliers, dedupe, normalize categories/units, imbalanced data¶
Check missing value¶
Address missing value
- Feature removal: Variables with more than 30% missing values are removed from the dataset.
- Row removal: missing values of a variable with less than 5% of the total observations are dropped.
- Imputation: Remaining missing entries are imputed using mean or median based on the variable distribution (normal or skewed distribution respectively).
# --- 1. Check missingness per column (counts and %) ---
missing_counts = secom.isna().sum()
missing_percent = missing_counts / len(secom) * 100 # percent missing per column
print("Missing values per column:")
print(missing_counts)
print("\nNumber of columns with any missing values:",
(missing_counts > 0).sum())
Missing values per column:
Time 0
0 6
1 7
2 14
3 14
..
586 1
587 1
588 1
589 1
Pass/Fail 0
Length: 592, dtype: int64
Number of columns with any missing values: 538
# --- 2. Visualize distribution of % missing per column ---
plt.figure(figsize=(10, 6))
plt.hist(
missing_percent,
bins=20,
edgecolor='black'
)
plt.title('Distribution of % Missing per Column in SECOM Dataset', fontsize=14)
plt.xlabel('Percent Missing in a Column', fontsize=12)
plt.ylabel('Count of Columns', fontsize=12)
plt.grid(axis='y', alpha=0.5)
plt.show()
# --- 3. Feature removal: drop variables with >30% missing values ---
feature_thresh = 30 # percent
cols_high_missing = missing_percent[missing_percent > feature_thresh].index
print(f"\nNumber of columns dropped (>{feature_thresh}% missing): {len(cols_high_missing)}")
print("Columns dropped:", list(cols_high_missing))
secom = secom.drop(columns=cols_high_missing)
Number of columns dropped (>30% missing): 32 Columns dropped: ['72', '73', '85', '109', '110', '111', '112', '157', '158', '220', '244', '245', '246', '247', '292', '293', '345', '346', '358', '382', '383', '384', '385', '492', '516', '517', '518', '519', '578', '579', '580', '581']
# --- 4. Row removal: for variables with <5% missing, drop their missing rows ---
# Recompute missingness after dropping high-missing columns (optional, but clean)
missing_counts = secom.isna().sum()
missing_percent = missing_counts / len(secom) * 100
row_thresh = 5 # percent
cols_for_row_drop = missing_percent[missing_percent < row_thresh].index
print(f"\nColumns used for row-wise NA removal (<{row_thresh}% missing): {len(cols_for_row_drop)}")
print("Columns for row drop:", list(cols_for_row_drop))
# Drop rows where these "almost complete" columns have missing values
secom = secom.dropna(subset=cols_for_row_drop)
print("\nFinal shape after cleaning:", secom.shape)
Columns used for row-wise NA removal (<5% missing): 540 Columns for row drop: ['Time', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '558', '559', '560', '561', '570', '571', '572', '573', '574', '575', '576', '577', '582', '583', '584', '585', '586', '587', '588', '589', 'Pass/Fail'] Final shape after cleaning: (1393, 560)
cols_to_plot = ["10", "40"]
# Loop to generate plots for each specified column
for col in cols_to_plot:
# Drop NaN values, as they cannot be plotted in a histogram or boxplot
clean_data = secom[col].dropna()
# Create a figure with two subplots side-by-side (1 row, 2 columns)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# 1. Histogram (Distribution Plot)
axes[0].hist(clean_data, bins=30, edgecolor='black', color='teal', alpha=0.7)
axes[0].set_title(f'Histogram of Feature {col} Values')
axes[0].set_xlabel(f'Feature {col} Value')
axes[0].set_ylabel('Frequency (Instance Count)')
axes[0].grid(axis='y', linestyle='--', alpha=0.7)
# 2. Boxplot (Distribution Summary)
axes[1].boxplot(clean_data, patch_artist=True,
boxprops=dict(facecolor='lightcoral', color='black'),
medianprops=dict(color='black'))
axes[1].set_title(f'Boxplot for Feature {col}')
axes[1].set_ylabel(f'Feature {col} Value')
axes[1].set_xticks([1], [f'Feature {col}']) # Label the single box
axes[1].grid(axis='y', linestyle='--', alpha=0.7)
# Adjust layout and save the figure
plt.tight_layout()
plt.savefig(f'feature_{col}_hist_boxplot.png')
plt.show()
plt.close(fig)
Imputing a summary statistic¶
Check the columns that still have missing values
cols_with_missing_values = secom.columns[secom.isna().sum()>0]
print(cols_with_missing_values)
secom[cols_with_missing_values]
Index(['546', '547', '548', '549', '550', '551', '552', '553', '554', '555',
'556', '557', '562', '563', '564', '565', '566', '567', '568', '569'],
dtype='object')
| 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.3526 | 408.798 | 74.640 | 0.7193 | 16.00 | 1.33 | 0.2829 | 7.1196 | 0.4989 | 53.1836 | 3.9139 | 1.7819 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 0.7942 | 411.136 | 74.654 | 0.1832 | 16.16 | 0.85 | 0.0857 | 7.1619 | 0.3752 | 23.0713 | 3.9306 | 1.1386 | 267.064 | 0.9032 | 1.10 | 0.6219 | 0.4122 | 0.2562 | 0.4119 | 68.8489 |
| 3 | 1.1650 | 372.822 | 72.442 | 1.8804 | 131.68 | 39.33 | 0.6812 | 56.9303 | 17.4781 | 161.4081 | 35.3198 | 54.2917 | 268.228 | 0.6511 | 7.32 | 0.1630 | 3.5611 | 0.0670 | 2.7290 | 25.0363 |
| 4 | 1.4636 | 399.914 | 79.156 | 1.0388 | 19.63 | 1.98 | 0.4287 | 9.7608 | 0.8311 | 70.9706 | 4.9086 | 2.5014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 1.2708 | 412.222 | 80.326 | 1.1050 | 16.06 | 1.65 | 0.5329 | 7.2448 | 0.6268 | 86.9463 | 3.8960 | 2.0541 | 254.006 | 0.8444 | 4.75 | 0.1905 | 1.8784 | 0.0743 | 1.8700 | 22.5598 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1537 | 0.8273 | 400.814 | 73.254 | 0.2235 | 14.53 | 1.33 | 0.0949 | 6.7381 | 0.5058 | 27.0176 | 3.6251 | 1.8156 | 270.038 | 0.7235 | 9.93 | 0.1420 | 3.9802 | 0.0579 | 3.6773 | 19.6342 |
| 1539 | 1.0795 | 404.148 | 73.766 | 1.0847 | 17.66 | 1.93 | 0.4177 | 8.2522 | 0.6858 | 100.4837 | 4.3697 | 2.6164 | 270.276 | 0.8724 | 8.79 | 0.2157 | 3.9604 | 0.0892 | 3.2522 | 24.7258 |
| 1540 | 1.1302 | 408.646 | 74.854 | 0.9679 | 10.91 | 0.76 | 0.4492 | 4.9069 | 0.2863 | 85.6474 | 2.6698 | 1.0153 | 248.652 | 0.7768 | 8.17 | 0.1185 | 3.0952 | 0.0493 | 3.2857 | 15.2498 |
| 1541 | 0.8273 | 400.814 | 73.254 | 0.2235 | 14.53 | 1.33 | 0.0949 | 6.7381 | 0.5058 | 27.0176 | 3.6251 | 1.8156 | 264.272 | 0.5671 | 4.98 | 0.0877 | 2.0902 | 0.0382 | 1.8844 | 15.4662 |
| 1550 | 0.5486 | 397.310 | 72.228 | 0.3441 | 19.10 | 1.39 | 0.1378 | 8.4989 | 0.5323 | 62.7106 | 4.8073 | 1.9245 | 264.272 | 0.5671 | 4.98 | 0.0877 | 2.0902 | 0.0382 | 1.8844 | 15.4662 |
1393 rows × 20 columns
Check the missing value varaibles' distribution
- Normal distribution
- Skewed distribution
def check_normality_and_kurtosis(
df,
cols=None,
alpha=0.05,
shapiro_max=5000, # Shapiro warning threshold (~5000); SECOM has n=1567 so this won't trigger
min_n=8, # minimum n to attempt Shapiro
kurt_min_n=20, # kurtosis test is reliable for n >= ~20
random_state=0
):
"""
For each numeric column (from `cols` or all df columns):
- Drop NaNs, then run Shapiro–Wilk for normality (p-value >= alpha => 'normal').
- Run kurtosis test for departure from normal kurtosis (p-value < alpha => 'skewed' per your spec).
Returns (results_df, normal_vars, skewed_vars) and also prints the two lists.
"""
cols_iter = list(df.columns) if cols is None else list(cols)
rng = np.random.default_rng(random_state)
rows = []
for col in cols_iter:
s = df[col]
# Only numeric columns
if not pd.api.types.is_numeric_dtype(s):
continue
x = s.dropna().to_numpy()
n = x.size
miss = s.isna().sum()
# Skip tiny or constant series
if n < min_n:
rows.append(dict(col=col, n=n, missing=miss, note=f"skipped: n<{min_n}"))
continue
if np.isclose(np.nanstd(x), 0.0):
rows.append(dict(col=col, n=n, missing=miss, note="skipped: constant"))
continue
# Descriptive skew (for direction only)
skew_val = stats.skew(x, bias=False)
# Shapiro–Wilk (cap to shapiro_max if ever needed)
try:
x_sh = x if n <= shapiro_max else rng.choice(x, size=shapiro_max, replace=False)
sh_W, sh_p = stats.shapiro(x_sh)
except Exception:
sh_W, sh_p = np.nan, np.nan
# Kurtosis test (excess kurtosis == 0 under normal)
kurt_z, kurt_p = np.nan, np.nan
if n >= kurt_min_n:
try:
kurt_z, kurt_p = stats.kurtosistest(x)
except Exception:
pass
rows.append({
"col": col,
"n": n,
"missing": miss,
"skew": skew_val, # descriptive (sign gives left/right)
"shapiro_W": sh_W,
"shapiro_p": sh_p,
"kurt_z": kurt_z,
"kurt_p": kurt_p,
"note": ""
})
res = pd.DataFrame(rows).set_index("col").sort_index()
# Classifications per your request
normal_mask = res["shapiro_p"] >= alpha
skewed_mask = res["kurt_p"] < alpha
normal_vars = res.index[normal_mask.fillna(False)].tolist()
skewed_vars = res.index[skewed_mask.fillna(False)].tolist()
# ---- Print summaries ----
print(f"Normally distributed by Shapiro–Wilk (p ≥ {alpha}): {len(normal_vars)}")
print(normal_vars)
print(f"\n'Skewed' by kurtosis test (p < {alpha}): {len(skewed_vars)}")
print(skewed_vars)
# Optional: show skew direction for the kurtosis-flagged vars
if len(skewed_vars) > 0:
dir_tbl = res.loc[skewed_vars, ["skew", "kurt_z", "kurt_p"]].sort_values("skew")
print("\nSkew direction (negative = left, positive = right) for kurtosis-flagged vars:")
print(dir_tbl)
return res, normal_vars, skewed_vars
# ---------- Usage on SECOM: only variables with missing values ----------
cols_with_missing_values = secom.columns[secom.isna().sum() > 0]
results, normal_vars, skewed_vars = check_normality_and_kurtosis(
secom,
cols=cols_with_missing_values,
alpha=0.05
)
Normally distributed by Shapiro–Wilk (p ≥ 0.05): 0
[]
'Skewed' by kurtosis test (p < 0.05): 20
['546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '562', '563', '564', '565', '566', '567', '568', '569']
Skew direction (negative = left, positive = right) for kurtosis-flagged vars:
skew kurt_z kurt_p
col
547 -0.444275 6.407610 1.478182e-10
562 0.133034 8.731383 2.515772e-18
548 0.849374 -8.559926 1.129400e-17
563 0.875335 3.499987 4.652811e-04
546 1.250570 8.887123 6.271149e-19
555 1.715929 12.157176 5.254390e-34
564 1.812195 14.132145 2.407008e-45
568 1.868507 14.660421 1.155540e-48
565 2.110839 13.034970 7.739770e-39
566 2.115757 15.807597 2.757860e-56
569 2.117241 12.724327 4.331909e-37
567 2.226898 13.519533 1.199349e-41
552 3.665908 17.340619 2.322215e-67
549 3.837846 17.678926 6.094651e-70
553 11.068557 23.415281 2.986535e-121
550 12.128992 23.610078 3.036858e-123
556 13.512496 23.874108 5.690932e-126
551 20.480947 24.640177 4.690008e-134
557 20.972181 24.678778 1.807598e-134
554 20.983550 24.692628 1.283435e-134
Conclusion: There are 0 variables with normal distribution and 20 missing value variables are skewed distribution.
In this next step, I want to fill missing values from skewed distributed variables with their median value.
num_cols = [c for c in skewed_vars if pd.api.types.is_numeric_dtype(secom[c])]
medians = secom[num_cols].median() # Series of medians per column
secom[num_cols] = secom[num_cols].fillna(medians) # column-wise fill
# -1 is Pass, 1 is fail
secom["Pass/Fail"].value_counts()
| count | |
|---|---|
| Pass/Fail | |
| -1 | 1294 |
| 1 | 99 |
secom.describe()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 577 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.0 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | ... | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 |
| mean | 3014.966274 | 2494.861981 | 2200.348900 | 1384.417093 | 2.928322 | 100.0 | 101.231608 | 0.122246 | 1.465172 | -0.000861 | ... | 16.434664 | 0.500049 | 0.015366 | 0.003858 | 3.078235 | 0.021743 | 0.016454 | 0.005279 | 98.088611 | -0.857861 |
| std | 74.195555 | 81.555556 | 29.825001 | 415.755060 | 42.128656 | 0.0 | 5.900409 | 0.004998 | 0.072696 | 0.015069 | ... | 12.336507 | 0.003426 | 0.018124 | 0.003930 | 3.776211 | 0.012602 | 0.008801 | 0.002865 | 92.838412 | 0.514067 |
| min | 2743.240000 | 2158.750000 | 2060.660000 | 0.000000 | 0.681500 | 100.0 | 82.131100 | 0.000000 | 1.191000 | -0.053400 | ... | 4.582000 | 0.477800 | 0.006000 | 0.001700 | 1.197500 | -0.006000 | 0.003200 | 0.001000 | 0.000000 | -1.000000 |
| 25% | 2966.240000 | 2450.890000 | 2180.966600 | 1084.377900 | 1.016000 | 100.0 | 98.131100 | 0.121100 | 1.412500 | -0.010800 | ... | 11.280000 | 0.497900 | 0.011600 | 0.003100 | 2.305500 | 0.013600 | 0.010600 | 0.003300 | 44.175400 | -1.000000 |
| 50% | 3011.840000 | 2498.770000 | 2201.066700 | 1285.214400 | 1.307600 | 100.0 | 101.666700 | 0.122400 | 1.462900 | -0.001300 | ... | 13.733200 | 0.500100 | 0.013800 | 0.003600 | 2.760200 | 0.020800 | 0.014800 | 0.004600 | 71.084200 | -1.000000 |
| 75% | 3057.450000 | 2538.740000 | 2218.577800 | 1588.509000 | 1.518800 | 100.0 | 104.586700 | 0.123800 | 1.518600 | 0.008500 | ... | 17.002800 | 0.502300 | 0.016400 | 0.004100 | 3.284400 | 0.027700 | 0.020300 | 0.006400 | 113.550600 | -1.000000 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 100.0 | 129.252200 | 0.128600 | 1.656400 | 0.074900 | ... | 96.960100 | 0.509800 | 0.476600 | 0.104500 | 99.303200 | 0.102800 | 0.079900 | 0.028600 | 737.304800 | 1.000000 |
8 rows × 559 columns
Applying clustering method for the original data¶
Data Preparation and Scaling for the cluster¶
# Copy the dataset
secom_upd = secom.copy()
# 1. Separate features (X) from the rest
# Drop the target and time column, and all other non-feature columns
feature_cols = [col for col in secom_upd.columns if col not in ["Time", "Pass/Fail"]]
X = secom_upd[feature_cols]
# Ensure all selected features are numeric (they should be after your cleaning steps)
X = X.select_dtypes(include=np.number)
# 2. Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Convert back to DataFrame for easier inspection
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
# Assign a variable for the number of rows/samples
n_samples = X_scaled_df.shape[0]
# Print shape to confirm
print(f"Original feature shape: {X.shape}")
print(f"Scaled feature shape: {X_scaled_df.shape}")
# 3. Count outliers per column: |z| > 3
z_thresh = 3
# Use axis=0 to get counts per column, then wrap in a Series with column names
outlier_counts = pd.Series(
(np.abs(X_scaled) > z_thresh).sum(axis=0),
index=X.columns
)
print("\nOutliers per variable (|z| > 3):")
print(outlier_counts.sort_values(ascending=False).head(10))
Original feature shape: (1393, 558) Scaled feature shape: (1393, 558) Outliers per variable (|z| > 3): 38 68 576 62 574 60 572 54 577 53 573 52 558 49 575 48 160 48 295 48 dtype: int64
cols_to_plot = ["38", "576", "574", "572", "577", "573"]
# Loop to generate plots for each specified column
for col in cols_to_plot:
# Drop NaN values, as they cannot be plotted in a histogram or boxplot
clean_data = secom[col].dropna()
# Create a figure with two subplots side-by-side (1 row, 2 columns)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# 1. Histogram (Distribution Plot)
axes[0].hist(clean_data, bins=30, edgecolor='black', color='teal', alpha=0.7)
axes[0].set_title(f'Histogram of Feature {col} Values')
axes[0].set_xlabel(f'Feature {col} Value')
axes[0].set_ylabel('Frequency (Instance Count)')
axes[0].grid(axis='y', linestyle='--', alpha=0.7)
# 2. Boxplot (Distribution Summary)
axes[1].boxplot(clean_data, patch_artist=True,
boxprops=dict(facecolor='lightcoral', color='black'),
medianprops=dict(color='black'))
axes[1].set_title(f'Boxplot for Feature {col}')
axes[1].set_ylabel(f'Feature {col} Value')
axes[1].set_xticks([1], [f'Feature {col}']) # Label the single box
axes[1].grid(axis='y', linestyle='--', alpha=0.7)
# Adjust layout and save the figure
plt.tight_layout()
plt.savefig(f'feature_{col}_hist_boxplot.png')
plt.show()
plt.close(fig)
X_reset = X.reset_index(drop=True)
X_reset = np.array(X_reset)
X_reset
array([[3.0957800e+03, 2.4651400e+03, 2.2304222e+03, ..., 2.0100000e-02,
6.0000000e-03, 2.0820450e+02],
[2.9326100e+03, 2.5599400e+03, 2.1864111e+03, ..., 4.8400000e-02,
1.4800000e-02, 8.2860200e+01],
[2.9887200e+03, 2.4799000e+03, 2.1990333e+03, ..., 1.4900000e-02,
4.4000000e-03, 7.3843200e+01],
...,
[2.9960400e+03, 2.5559200e+03, 2.1907666e+03, ..., 1.2300000e-02,
4.6000000e-03, 6.7667600e+01],
[3.2463100e+03, 2.4997900e+03, 2.2168111e+03, ..., 6.3000000e-03,
1.9000000e-03, 2.3597900e+01],
[3.0722000e+03, 2.4064700e+03, 2.1954444e+03, ..., 6.5000000e-03,
2.2000000e-03, 2.7551400e+01]])
SMOTE Logistic Regression¶
# --- ASSUMED DATA PREPARATION ---
# X_full is the full cleaned feature set
X_full = X_reset.copy()
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values
RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "1st Stage: Full Cleaned Features"
print(f"======== {STAGE_NAME} ({X_full.shape[1]} features) ========")
# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
X_full, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)
# 2. Preprocessing Pipeline: Impute (Median) -> Scale
preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)
# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")
# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)
# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)
# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)
print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC: {auprc:.4f}")
# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))
# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")
plt.grid(False)
# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)
plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 1st Stage: Full Cleaned Features (558 features) ======== Train size: 975 -> Resampled size: 1812 --- Performance Metrics (Test Set) --- F1-score: 0.2222 AUPRC: 0.1491
KMEANS METHOD¶
Using the elbow method to find the optimal number of clusters¶
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X_reset)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
The plot show that there is an elbow or "bend" at k=3 or 4 clusters
Training the K-means model on the dataset¶
kmeans = KMeans(n_clusters = 2, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X_reset)
pd.Series(y_kmeans).value_counts()
| count | |
|---|---|
| 1 | 1101 |
| 0 | 292 |
Visualising the clusters¶
plt.scatter(X_reset[y_kmeans == 0, 0], X_reset[y_kmeans == 0, 1], s = 15, c = 'lightskyblue', label = 'Cluster 1')
plt.scatter(X_reset[y_kmeans == 1, 0], X_reset[y_kmeans == 1, 1], s = 15, c = 'orange', label = 'Cluster 2')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 20, c = 'red', label = 'Centroids')
plt.title('Clusters of the test')
plt.xlabel('Component 1')
plt.ylabel('Component 1')
plt.legend()
plt.show()
Combine t-SNE and Kmeans¶
Apply a dimension reduction technique like t-SNE to get the data into that lower dimension before plotting the Kmeans cluster
# Assuming X_scaled_df is your high-dimensional, scaled feature data
# Assuming y_kmeans contains the cluster labels from your K-Means run
# --- Step 1: Apply 2D t-SNE ---
# n_components=2 for 2D visualization
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne_2d = tsne_2d.fit_transform(X_scaled_df)
# Convert to DataFrame for easier plotting
X_tsne_df = pd.DataFrame(data = X_tsne_2d, columns = ['tSNE_1', 'tSNE_2'])
X_tsne_df['KMeans_Cluster'] = y_kmeans # Add cluster labels
# --- Step 2: Plot the 2D t-SNE results ---
plt.figure(figsize=(8, 6))
for cluster in sorted(X_tsne_df['KMeans_Cluster'].unique()):
subset = X_tsne_df[X_tsne_df['KMeans_Cluster'] == cluster]
plt.scatter(
subset['tSNE_1'],
subset['tSNE_2'],
s = 10,
label = f'Cluster {cluster}'
)
plt.title('K-Means Clusters Visualized by 2D t-SNE')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()
# Assuming X_scaled_df and y_kmeans are ready
# --- Step 1: Apply 3D t-SNE ---
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne_3d = tsne_3d.fit_transform(X_scaled_df)
# Convert to DataFrame
X_tsne_df_3d = pd.DataFrame(data = X_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
X_tsne_df_3d['KMeans_Cluster'] = y_kmeans
# Define the color for the clusters
cluster_colors = {
0: 'lightskyblue',
1: 'orange',
}
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
for cluster_label, color_name in cluster_colors.items():
subset = X_tsne_df_3d[X_tsne_df_3d['KMeans_Cluster'] == cluster_label]
ax.scatter(
subset['tSNE_1'],
subset['tSNE_2'],
subset['tSNE_3'],
s=10,
c=color_name, # <<< Use the specific color here
label=f'Cluster {cluster_label}',
alpha=0.7
)
ax.set_title('K-Means Clusters Visualized by 3D t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.legend()
plt.show()
# Assuming X_tsne_df_3d is available and contains the data.
# Note: You'll need to run your data preparation steps first
# (TSNE n_components=3 and K-Means) to populate this DataFrame.
# Ensure the cluster column is treated as a string/categorical for distinct coloring
X_tsne_df_3d['KMeans_Cluster_Str'] = X_tsne_df_3d['KMeans_Cluster'].astype(str)
# Define the color of the Kmeans cluster
specific_plotly_colors = {
'0': 'lightskyblue',
'1': 'orange',
}
# Create the interactive 3D scatter plot using Plotly Express
fig = px.scatter_3d(
X_tsne_df_3d,
x='tSNE_1',
y='tSNE_2',
z='tSNE_3',
color='KMeans_Cluster_Str', # Color points by cluster
color_discrete_map=specific_plotly_colors, # <<< Assign specific colors
title='Interactive 3D K-Means Clusters (Plotly)',
opacity=0.7, # Transparency level
size_max=5, # <<< This controls the dot size (max diameter in pixels)
labels={'tSNE_1': 't-SNE Component 1',
'tSNE_2': 't-SNE Component 2',
'tSNE_3': 't-SNE Component 3'}
)
# You can further customize the marker size and styling if needed:
fig.update_traces(marker=dict(size=3)) # <<< Sets a fixed size for all markers
fig.show() # This displays the rotatable plot in Google Colab
t-SNE¶
# High number of features (590+) requires dimensionality reduction for visualization.
# Note: t-SNE is computationally expensive; reduce n_components to 2 for visualization.
# Run t-SNE on the scaled data
# n_components=2 for 2D visualization
tsne = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne = tsne.fit_transform(X_scaled_df)
# Store the t-SNE results in the DataFrame
secom_upd['tSNE_1'] = X_tsne[:, 0]
secom_upd['tSNE_2'] = X_tsne[:, 1]
print("t-SNE reduction complete. Results stored in 'tSNE_1' and 'tSNE_2'.")
t-SNE reduction complete. Results stored in 'tSNE_1' and 'tSNE_2'.
DBSCAN and t-SNE¶
# 1. --- FIT DBSCAN MODEL ---
# NOTE: The parameters (eps, min_samples) must be tuned for your SECOM data.
# A common starting point for min_samples is 2 * number of dimensions or ln(n_samples).
min_samples_start = 200 #int(np.log(X_scaled_df.shape[0]))
eps_start = 2 # Start with a large value and tune down, or use a k-distance plot.
dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start)
secom_clusters = dbscan.fit_predict(X_scaled_df)
print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")
# 2. --- PREPARE FOR VISUALIZATION (using t-SNE) ---
# DBSCAN was fit on the high-dimensional data, but we need 2D data for a plot.
# We will use t-SNE for dimensionality reduction for visualization.
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne_2d = tsne_2d.fit_transform(X_scaled_df)
# Create a DataFrame for plotting
df_secom = pd.DataFrame(X_tsne_2d, columns=['tSNE_1', 'tSNE_2'])
df_secom['Cluster'] = secom_clusters
# 3. --- PLOT THE DBSCAN CLUSTERS (using t-SNE components) ---
plt.figure(figsize=(10, 8))
plt.scatter(
df_secom['tSNE_1'],
df_secom['tSNE_2'],
c=df_secom['Cluster'],
cmap='viridis', # 'viridis' often works well for many colors
s=10, # Minimized dot size
alpha=0.7
)
plt.title(f'DBSCAN Clusters on SECOM Data (Visualized with 2D t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1] Cluster Distribution (Tuning Required): -1 1393 Name: count, dtype: int64
Autoencoder for the original dataset¶
# =========================================================================
# 1. DATA PREPARATION (SECOM specific changes)
# =========================================================================
X_secom_scaled_array = scaler.fit_transform(X)
y_secom_target = secom_upd['Pass/Fail'].values
# NOTE: Load your cleaned, scaled SECOM feature array (X_secom_scaled_array)
# and the target array (y_secom_target) here.
# We estimate the input dimension based on your previous steps (~590 features).
INPUT_DIM = X_secom_scaled_array.shape[1]
LATENT_DIM = 10
print(f"Detected Input Dimension from data: {INPUT_DIM}") # Check the output!
# Convert to tensor and create DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_tensor = torch.tensor(X_secom_scaled_array, dtype=torch.float32).to(device)
dataset = torch.utils.data.TensorDataset(X_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=True)
# =========================================================================
# 2. MODEL DEFINITION (Scaled for High Dimensionality)
# =========================================================================
class DeepAutoencoderSECOM(nn.Module):
def __init__(self, input_dim=INPUT_DIM, latent_dim=LATENT_DIM):
super(DeepAutoencoderSECOM, self).__init__()
# Scaling layers for 590+ features, creating a wider funnel
hidden_1 = 512
hidden_2 = 128
hidden_3 = 64
# Encoder: 4 layers reducing down to the 2D bottleneck
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_1), # 589 -> 512
nn.ReLU(),
nn.Linear(hidden_1, hidden_2), # 512 -> 128
nn.ReLU(),
nn.Linear(hidden_2, hidden_3), # 128 -> 64
nn.ReLU(),
nn.Linear(hidden_3, latent_dim) # 64 -> 2D bottleneck
)
# Decoder: 4 layers reconstructing the input
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_3), # 8 -> 64
nn.ReLU(),
nn.Linear(hidden_3, hidden_2), # 64 -> 128
nn.ReLU(),
nn.Linear(hidden_2, hidden_1), # 128 -> 512
nn.ReLU(),
nn.Linear(hidden_1, input_dim) # 512 -> 589
)
def forward(self, x):
z = self.encoder(x)
x_hat = self.decoder(z)
return z, x_hat
def encode(self, x):
"""Method to explicitly get the latent representation Z."""
return self.encoder(x)
# =========================================================================
# 3. TRAINING & PLOTTING
# =========================================================================
# Initialize model, optimizer, and loss function
model = DeepAutoencoderSECOM(input_dim=INPUT_DIM, latent_dim=LATENT_DIM).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
# Training loop
epochs = 50 # Reduced epochs for faster execution/testing. Increase this for final training.
model.train()
print(f"Starting training on {INPUT_DIM} features with {epochs} epochs...")
for epoch in range(epochs):
total_loss = 0
for batch in loader:
batch_x = batch[0]
optimizer.zero_grad()
z, x_hat = model(batch_x)
loss = criterion(x_hat, batch_x)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(loader)
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss={avg_loss:.6f}")
# Extract latent representation from encoder
model.eval()
with torch.no_grad():
Z, _ = model(X_tensor)
Z_np = Z.cpu().numpy()
# Plot latent 2D bottleneck space, colored by Pass/Fail target
plt.figure(figsize=(8, 7))
# Use the original SECOM labels (y_secom_target) for coloring
# Assuming 1 is 'Fail' and -1 is 'Pass' (based on your initial image)
# We use a custom cmap for clarity
plt.scatter(Z_np[:, 0], Z_np[:, 1], c=y_secom_target,
cmap='coolwarm', s=10, alpha=0.7)
plt.title(f"SECOM Data - Deep Autoencoder Latent Space ({LATENT_DIM}D)")
plt.xlabel("Latent Component 1")
plt.ylabel("Latent Component 2")
plt.colorbar(label="Pass/Fail Label (-1 or 1)")
plt.grid(True)
plt.show()
Detected Input Dimension from data: 558 Starting training on 558 features with 50 epochs... Epoch 5/50, Loss=0.703927 Epoch 10/50, Loss=0.647051 Epoch 15/50, Loss=0.613516 Epoch 20/50, Loss=0.564242 Epoch 25/50, Loss=0.527392 Epoch 30/50, Loss=0.496396 Epoch 35/50, Loss=0.463096 Epoch 40/50, Loss=0.444802 Epoch 45/50, Loss=0.423758 Epoch 50/50, Loss=0.404138
# =========================================================================
# 4. Calculate Final Reconstruction Error on the entire dataset
# =========================================================================
model.eval()
with torch.no_grad():
# Pass the entire dataset (X_tensor) through the model
_, X_hat_final = model(X_tensor)
# Calculate the MSE loss between the final reconstruction and the original data
final_reconstruction_error = criterion(X_hat_final, X_tensor).item()
print(f"\nFinal Reconstruction MSE Error on Full Dataset: {final_reconstruction_error:.6f}")
Final Reconstruction MSE Error on Full Dataset: 0.398797
2. Feature Selection¶
Low-variance feature¶
The variables have low variance are quite useless for predicting. So I dicide to remove all of them.
secom_upd = secom.copy()
# Find near-constant variables, excluding the target column "Pass/Fail"
near_const = []
for c in secom_upd.columns:
if c == "Pass/Fail":
continue # don't consider the label column
vc = secom_upd[c].value_counts(dropna=False, normalize=True)
if not vc.empty and vc.iloc[0] >= 0.90:
near_const.append(c)
print("Number of variables where ≥90% of values are identical (excluding 'Pass/Fail'):", len(near_const))
print("Example of near-constant variables:", near_const[:10])
print(secom_upd[near_const].value_counts())
# Remove low-variance variables, but keep "Pass/Fail"
secom_reduced = secom_upd.drop(columns=near_const)
print("Original shape:", secom_upd.shape)
print("Reduced shape:", secom_reduced.shape)
secom_reduced.describe()
Number of variables where ≥90% of values are identical (excluding 'Pass/Fail'): 126
Example of near-constant variables: ['5', '13', '42', '49', '52', '69', '74', '97', '114', '141']
5 13 42 49 52 69 74 97 114 141 149 178 179 186 189 190 191 192 193 194 206 209 226 229 230 231 232 233 234 235 236 237 240 241 242 243 249 256 257 258 259 260 261 262 263 264 265 266 276 284 313 314 315 322 325 326 327 328 329 330 342 347 364 369 370 371 372 373 374 375 378 379 380 381 387 394 395 396 397 398 399 400 401 402 403 404 414 422 449 450 451 458 461 462 463 464 465 466 478 481 498 501 502 503 504 505 506 507 508 509 512 513 514 515 521 528 529 530 531 532 533 534 535 536 537 538
100.0 0.0 70.0 1.0 0.0 1.0 0.0000 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1371
1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0001 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0006 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0008 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0034 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0011 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0007 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0068 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0021 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0013 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0127 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0040 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0016 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0158 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0050 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0018 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0178 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0056 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0195 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0062 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0020 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0141 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0046 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 718.6040 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0021 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0101 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0042 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 474.6376 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0044 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0442 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0140 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0051 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0306 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0110 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 604.2009 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0058 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0320 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0102 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 553.2097 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0129 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1285 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0406 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0149 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1159 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0370 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 776.2169 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0157 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0249 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0092 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 158.2158 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0197 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1970 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0623 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0228 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2067 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0650 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 907.9100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0316 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3161 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
0.0414 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4138 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1309 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1000.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
4.1955 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 46.15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4472 13.9147 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 200.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
Name: count, dtype: int64
Original shape: (1393, 560)
Reduced shape: (1393, 434)
| 0 | 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | 10 | ... | 577 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | ... | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 | 1393.000000 |
| mean | 3014.966274 | 2494.861981 | 2200.348900 | 1384.417093 | 2.928322 | 101.231608 | 0.122246 | 1.465172 | -0.000861 | 0.000021 | ... | 16.434664 | 0.500049 | 0.015366 | 0.003858 | 3.078235 | 0.021743 | 0.016454 | 0.005279 | 98.088611 | -0.857861 |
| std | 74.195555 | 81.555556 | 29.825001 | 415.755060 | 42.128656 | 5.900409 | 0.004998 | 0.072696 | 0.015069 | 0.009343 | ... | 12.336507 | 0.003426 | 0.018124 | 0.003930 | 3.776211 | 0.012602 | 0.008801 | 0.002865 | 92.838412 | 0.514067 |
| min | 2743.240000 | 2158.750000 | 2060.660000 | 0.000000 | 0.681500 | 82.131100 | 0.000000 | 1.191000 | -0.053400 | -0.034900 | ... | 4.582000 | 0.477800 | 0.006000 | 0.001700 | 1.197500 | -0.006000 | 0.003200 | 0.001000 | 0.000000 | -1.000000 |
| 25% | 2966.240000 | 2450.890000 | 2180.966600 | 1084.377900 | 1.016000 | 98.131100 | 0.121100 | 1.412500 | -0.010800 | -0.005700 | ... | 11.280000 | 0.497900 | 0.011600 | 0.003100 | 2.305500 | 0.013600 | 0.010600 | 0.003300 | 44.175400 | -1.000000 |
| 50% | 3011.840000 | 2498.770000 | 2201.066700 | 1285.214400 | 1.307600 | 101.666700 | 0.122400 | 1.462900 | -0.001300 | 0.000400 | ... | 13.733200 | 0.500100 | 0.013800 | 0.003600 | 2.760200 | 0.020800 | 0.014800 | 0.004600 | 71.084200 | -1.000000 |
| 75% | 3057.450000 | 2538.740000 | 2218.577800 | 1588.509000 | 1.518800 | 104.586700 | 0.123800 | 1.518600 | 0.008500 | 0.005800 | ... | 17.002800 | 0.502300 | 0.016400 | 0.004100 | 3.284400 | 0.027700 | 0.020300 | 0.006400 | 113.550600 | -1.000000 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 129.252200 | 0.128600 | 1.656400 | 0.074900 | 0.053000 | ... | 96.960100 | 0.509800 | 0.476600 | 0.104500 | 99.303200 | 0.102800 | 0.079900 | 0.028600 | 737.304800 | 1.000000 |
8 rows × 433 columns
Multicollinearity - Filter by correlation test¶
_ Pearson correlation coefficient test is used for linear associations.
_ Spearman's rank coefficient test is used for non-linear associations.
_ Is there any monotonic relationships? (In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate.)
# Subset a dataframe only contains x variables
df = secom_reduced.drop(columns=["Time", "Pass/Fail"], errors="ignore")
df = df.select_dtypes(include=[np.number]).copy()
df = df.fillna(df.median(numeric_only=True))
r_thresh = 0.95 # linear redundancy cutoff
rho_thresh = 0.95 # monotonic redundancy cutoff
linear_guard = 0.85 # only treat as "nonlinear" if |Pearson| < 0.85
# ---- Stage 1: Pearson (linear) ----
R = df.corr(method="pearson").abs()
upper = R.where(np.triu(np.ones(R.shape), k=1).astype(bool))
drop1 = [c for c in upper.columns if (upper[c] >= r_thresh).any()]
keep1 = [c for c in df.columns if c not in drop1]
df1 = df[keep1]
print(f"[Stage 1] {df.shape[1]} → {df1.shape[1]} after linear prune")
# ---- Stage 2: Spearman (monotonic but not strongly linear) ----
if df1.shape[1] > 1:
R1 = df1.corr(method="pearson").abs()
S1 = df1.corr(method="spearman").abs()
# We only consider pairs that are high-Spearman and low-Pearson
mask = (S1 >= rho_thresh) & (R1 < linear_guard)
mask_upper = mask.where(np.triu(np.ones(mask.shape), k=1).astype(bool))
drop2 = [c for c in mask_upper.columns if mask_upper[c].any()]
else:
drop2 = []
keep_final = [c for c in keep1 if c not in drop2]
df_pruned = df[keep_final]
print(f"[Stage 2] {df1.shape[1]} → {df_pruned.shape[1]} after monotonic prune")
selected_cols = keep_final
[Stage 1] 432 → 268 after linear prune [Stage 2] 268 → 263 after monotonic prune
# --- Easy report: just the names ---
print("\n[Stage 1] Linear prune (|Pearson r| ≥", r_thresh, ")")
print("Dropped:", len(drop1))
print(drop1) # list of column names dropped in Stage 1
print("\n[Stage 2] Monotonic prune (|Spearman ρ| ≥", rho_thresh, "and |Pearson r| <", linear_guard, ")")
print("Dropped:", len(drop2))
print(drop2) # list of column names dropped in Stage 2
[Stage 1] Linear prune (|Pearson r| ≥ 0.95 ) Dropped: 164 ['27', '36', '96', '104', '105', '106', '127', '140', '148', '152', '165', '174', '252', '271', '272', '274', '275', '277', '279', '280', '281', '282', '283', '285', '286', '287', '288', '289', '291', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '317', '318', '319', '320', '321', '323', '324', '332', '333', '334', '335', '336', '338', '339', '340', '341', '343', '344', '349', '350', '351', '352', '353', '354', '355', '357', '359', '360', '361', '362', '363', '365', '366', '376', '386', '388', '389', '390', '391', '392', '393', '405', '406', '407', '408', '409', '410', '411', '415', '416', '417', '420', '421', '424', '425', '426', '427', '428', '429', '435', '436', '437', '440', '441', '442', '443', '444', '445', '446', '447', '448', '452', '453', '454', '455', '457', '459', '467', '469', '470', '475', '477', '479', '490', '491', '493', '494', '495', '497', '520', '522', '523', '524', '525', '526', '527', '539', '540', '541', '545', '552', '553', '554', '556', '557', '561', '566', '567', '568', '574', '575', '576', '577', '584', '585', '588'] [Stage 2] Monotonic prune (|Spearman ρ| ≥ 0.95 and |Pearson r| < 0.85 ) Dropped: 5 ['18', '201', '337', '431', '496']
Heatmap for correlation matrix¶
# ---- 1. Select the specific variables by position ----
var_names = ["36", "34", "174", "172", "275", "4", "140", "585", "448"]
df_sub = df[var_names].copy()
df_sub_scaled = df_sub
# ---- 3. Pearson correlation matrix ----
corr_matrix = df_sub_scaled.corr(method="pearson")
# ---- 4. Heatmap ----
plt.figure(figsize=(8, 6))
sns.heatmap(
corr_matrix,
annot=True, # show numbers; set to False if too busy
fmt=".2f",
cmap="coolwarm",
vmin=-1, vmax=1,
square=True,
linewidths=0.5
)
plt.title("Pearson Correlation Heatmap – Selected Variables")
plt.tight_layout()
plt.show()
cols = ["431", "160", "201", "196", "496", "224", "18", "12", "337"]
# Keep only those that actually exist in df (in case any are missing)
cols_in_df = [c for c in cols if c in df.columns]
print("Columns used for Spearman heatmap:", cols_in_df)
df_sub = df[cols_in_df].copy()
# ---- Spearman correlation matrix ----
spearman_corr = df_sub.corr(method="spearman")
# ---- Heatmap using matplotlib ----
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(spearman_corr, vmin=-1, vmax=1)
# Add colorbar
cbar = fig.colorbar(im, ax=ax)
cbar.set_label("Spearman correlation", rotation=270, labelpad=15)
# Set tick labels
ax.set_xticks(np.arange(len(cols_in_df)))
ax.set_yticks(np.arange(len(cols_in_df)))
ax.set_xticklabels(cols_in_df, rotation=45, ha="right")
ax.set_yticklabels(cols_in_df)
ax.set_title("Spearman Correlation Heatmap – Selected Variables")
plt.tight_layout()
plt.show()
Columns used for Spearman heatmap: ['431', '160', '201', '196', '496', '224', '18', '12', '337']
Scatterplot of pair of varaibles that has top 5 highest correlation in Stage 1 - Linear prune
import os
def plot_stage1_linear_pairs(df, R, drop1, keep1, r_thresh=0.95, top_k=12, save_dir=None):
"""
For each column in drop1, find the kept column in keep1 with the strongest |Pearson r|
(>= r_thresh) and plot scatter + fitted line. Plots up to top_k strongest pairs.
"""
pairs = []
for c in drop1:
# Prefer a KEPT partner so you visualize the redundancy to what remains
cand = [k for k in keep1 if k != c and abs(R.loc[c, k]) >= r_thresh]
if not cand:
# fallback: any column meeting threshold (rare)
cand = [k for k in df.columns if k != c and abs(R.loc[c, k]) >= r_thresh]
if cand:
k = max(cand, key=lambda col: abs(R.loc[c, col]))
pairs.append((c, k, float(R.loc[c, k])))
if not pairs:
print("No Stage-1 pairs found at this threshold.")
return
# strongest first
pairs.sort(key=lambda t: -abs(t[2]))
if save_dir:
os.makedirs(save_dir, exist_ok=True)
for x_col, y_col, r_val in pairs[:top_k]:
x = df[x_col].to_numpy()
y = df[y_col].to_numpy()
m = np.isfinite(x) & np.isfinite(y)
x, y = x[m], y[m]
# simple linear fit for the smoothing line
a, b = np.polyfit(x, y, 1)
x_line = np.linspace(x.min(), x.max(), 200)
y_line = a * x_line + b
plt.figure()
plt.scatter(x, y, s=8, alpha=0.6)
plt.plot(x_line, y_line, linewidth=2)
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title(f"{x_col} vs {y_col} | Pearson r = {r_val:.3f}")
plt.tight_layout()
if save_dir:
fname = f"{x_col}__vs__{y_col}__r_{abs(r_val):.3f}.png"
plt.savefig(os.path.join(save_dir, fname), dpi=150)
plt.show()
# Use it:
plot_stage1_linear_pairs(df, R, drop1, keep1, r_thresh=r_thresh, top_k=12, save_dir="stage1_pairs")
Scatterplot of variables in Stage 2 - Monotonic prune
# === Plot strongest Stage-2 (monotonic but not strongly linear) pairs ===
from sklearn.isotonic import IsotonicRegression
pairs = []
for c in drop2:
# partners that satisfy your Stage-2 rule with column c
candidates = [k for k in df1.columns
if k != c and (S1.loc[c, k] >= rho_thresh) and (R1.loc[c, k] < linear_guard)]
if not candidates:
continue
# keep the partner with the largest |Spearman|
best = max(candidates, key=lambda k: S1.loc[c, k])
# signed Spearman for direction; Pearson shown as absolute (matches your rule)
s_signed = df1[c].corr(df1[best], method="spearman")
r_abs = R1.loc[c, best]
pairs.append((c, best, float(s_signed), float(r_abs)))
# strongest first; plot up to N figures
pairs.sort(key=lambda t: -abs(t[2]))
N = min(12, len(pairs)) # change to see more/less plots
if N == 0:
print("No Stage-2 monotonic pairs found at current thresholds.")
else:
for x_col, y_col, s_val, r_abs in pairs[:N]:
x = df1[x_col].to_numpy()
y = df1[y_col].to_numpy()
m = np.isfinite(x) & np.isfinite(y)
x, y = x[m], y[m]
# sort by x so the line draws cleanly
order = np.argsort(x)
xs, ys = x[order], y[order]
# monotonic smoother (handles increasing or decreasing)
ir = IsotonicRegression(increasing=True, out_of_bounds="clip")
if s_val >= 0:
y_fit = ir.fit_transform(xs, ys)
else:
y_fit = -ir.fit_transform(xs, -ys)
plt.figure()
plt.scatter(xs, ys, s=8, alpha=0.6)
plt.plot(xs, y_fit, linewidth=2)
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title(f"{x_col} vs {y_col} | Spearman={s_val:.2f}, |Pearson|={r_abs:.2f}")
plt.tight_layout()
plt.show()
Comment:
The off-diagonal panels show strong, monotonic but non-linear relations (mostly inverse “L”/hyperbolic shapes). That’s why these pairs passed your Stage-2 rule (high Spearman, low Pearson).
Distributions are highly skewed with floor/saturation near zero and heteroscedasticity (fan shapes). A few outliers exist but don’t drive the overall trend.
Substantively, these five sensors likely track the same latent process/step; they’re largely redundant for modeling.
SECOM labels from {-1, 1} → {0, 1} with 1 = Fail and 0 = Pass.¶
# Target column
y_col = "Pass/Fail"
# 1) Pull y aligned to df_pruned's rows
y_raw = secom_reduced.loc[df_pruned.index, y_col]
# (optional) sanity check so you don't silently mislabel
vals = pd.unique(y_raw.dropna())
assert set(vals).issubset({-1, 1}), f"Unexpected labels in {y_col}: {vals}"
# 2) Remap {-1, 1} -> {0, 1} (0=Pass, 1=Fail)
# You can use either replace(...) or a boolean expression; both shown:
y = y_raw.replace({-1: 0, 1: 1}).astype("int8")
# y = (y_raw == 1).astype("int8") # equivalent
print(y)
X = df_pruned
print(X)
1 0
2 1
3 0
4 0
5 0
..
1537 0
1539 0
1540 0
1541 0
1550 0
Name: Pass/Fail, Length: 1393, dtype: int8
0 1 2 3 4 6 7 \
1 3095.78 2465.14 2230.4222 1463.6606 0.8294 102.3433 0.1247
2 2932.61 2559.94 2186.4111 1698.0172 1.5102 95.4878 0.1241
3 2988.72 2479.90 2199.0333 909.7926 1.3204 104.2367 0.1217
4 3032.24 2502.87 2233.3667 1326.5200 1.5334 100.3967 0.1235
5 2946.25 2432.84 2233.3667 1326.5200 1.5334 100.3967 0.1235
... ... ... ... ... ... ... ...
1537 3006.22 2525.20 2192.7889 1268.5852 1.9935 104.5867 0.1268
1539 2908.94 2560.99 2187.3444 2882.8558 1.5876 85.4189 0.1235
1540 2996.04 2555.92 2190.7666 3530.2362 0.8017 83.8767 0.1249
1541 3246.31 2499.79 2216.8111 1190.4067 2.5148 114.5533 0.1230
1550 3072.20 2406.47 2195.4444 2914.1792 1.5978 85.1011 0.1235
8 9 10 ... 569 570 571 572 573 \
1 1.4966 -0.0005 -0.0148 ... 17.1574 535.0164 2.4335 5.92 0.2653
2 1.4436 0.0041 0.0013 ... 68.8489 535.0245 2.0293 11.21 0.1882
3 1.4882 -0.0124 -0.0033 ... 25.0363 530.5682 2.0253 9.33 0.1738
4 1.5031 -0.0031 -0.0072 ... 17.1574 532.0155 2.0275 8.83 0.2224
5 1.5287 0.0167 0.0055 ... 22.5598 534.2091 2.3236 8.91 0.3201
... ... ... ... ... ... ... ... ... ...
1537 1.4522 -0.0039 -0.0075 ... 19.6342 533.7318 2.0706 8.13 0.5545
1539 1.4167 0.0041 -0.0056 ... 24.7258 569.9964 1.7961 441.92 1.2113
1540 1.4158 -0.0029 0.0000 ... 15.2498 525.6127 2.0059 10.25 0.5070
1541 1.3966 -0.0057 -0.0169 ... 15.4662 526.2355 2.1114 8.15 0.2497
1550 1.3148 -0.0024 0.0039 ... 15.4662 532.1618 2.1401 7.11 0.4659
582 583 586 587 589
1 0.5019 0.0223 0.0096 0.0201 208.2045
2 0.4958 0.0157 0.0584 0.0484 82.8602
3 0.4990 0.0103 0.0202 0.0149 73.8432
4 0.4800 0.4766 0.0202 0.0149 73.8432
5 0.4949 0.0189 0.0342 0.0151 44.0077
... ... ... ... ... ...
1537 0.4942 0.0175 0.0199 0.0097 48.7045
1539 0.4987 0.0118 0.0237 0.0112 47.3376
1540 0.5011 0.0163 0.0181 0.0123 67.6676
1541 0.5021 0.0103 0.0266 0.0063 23.5979
1550 0.5034 0.0178 0.0236 0.0065 27.5514
[1393 rows x 263 columns]
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
confusion_matrix, f1_score, average_precision_score,
precision_recall_curve, ConfusionMatrixDisplay
)
from imblearn.over_sampling import SMOTE
# --- ASSUMED DATA PREPARATION ---
# X_df is the cleaned feature set
X_df = df_pruned.copy()
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values
RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "2nd Stage: 263 Features"
print(f"======== {STAGE_NAME} ({X_df.shape[1]} features) ========")
# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
X_df, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)
# 2. Preprocessing Pipeline: Impute (Median) -> Scale
preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)
# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")
# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)
# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)
# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)
print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC: {auprc:.4f}")
# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))
# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")
plt.grid(False)
# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)
plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 2nd Stage: 263 Features (263 features) ======== Train size: 975 -> Resampled size: 1812 --- Performance Metrics (Test Set) --- F1-score: 0.2222 AUPRC: 0.1537
Recursive Feature Elimination¶
I performed supervised feature selection using Recursive Feature Elimination with cross‑validation (RFECV) on a cleaned feature set (df_pruned) that already had:
missing values handled,
near‑constant (low‑variance) variables removed, and
strongly collinear variables pruned.
RFE then learned which of the remaining sensors are most predictive of failures, and selected a compact subset while nested cross‑validation gave unbiased performance estimates and a stability read‑out (how often each feature is chosen across folds).
After handling missing values, removing near‑constant variables, and pruning highly correlated sensors, we applied recursive feature elimination with cross‑validation (RFECV) using a class‑weighted logistic regression base estimator. All preprocessing (median imputation, z‑scaling) and feature selection were encapsulated in a Pipeline and evaluated under a nested 5‑fold stratified cross‑validation scheme. In the inner loop, RFECV iteratively removed the least important features (10% per step), and selected the number of features that maximized average precision; the outer loop provided unbiased estimates of generalization performance. We additionally computed the selection frequency of each feature across outer folds to quantify stability and defined a ‘stable’ subset as features selected in at least 60% of folds.
# ============================
# Recursive Feature Elimination (RFECV) on df_pruned
# Place this AFTER you define `df_pruned` and `y`
# ============================
RANDOM_STATE = 42
# --- 0) Inputs from your pipeline ---
X = df_pruned.copy() # features *after* your low-variance + correlation pruning
feature_names = X.columns
assert len(X) == len(y), "X and y must align"
# --- 1) Base estimator for RFE (fast, linear, class-weighted for imbalance) ---
base_est = LogisticRegression(
penalty="l2",
solver="liblinear",
class_weight="balanced",
max_iter=5000,
random_state=RANDOM_STATE
)
# --- 2) Inner CV (chooses how many features RFE keeps) + outer CV (evaluates the entire pipeline on unseen data → unbiased metrics.) ---
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE + 1)
# Reasonable lower bound: keep at least 5% of features or 20, whichever is larger
min_keep = max(20, int(0.05 * X.shape[1]))
rfecv = RFECV(
estimator=base_est, # Train the estimator, rank features by importance
step=0.1, # remove ~10% of remaining features per step; use an int (e.g., 50) to cap compute
min_features_to_select=min_keep,
cv=inner_cv, # Re‑fit and re‑score via inner CV
scoring="average_precision", # AUPRC is robust for your class imbalance
n_jobs=-1
)
pipe = Pipeline(steps=[
# Everything happens inside the pipeline so the imputer, scaler, and feature selection are fit only on training folds and then applied to validation/test folds. This prevents leakage.
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
("rfe", rfecv),
# After RFE chooses a subset, a fresh logistic model is trained on that subset in each fold.
("clf", LogisticRegression(
penalty="l2",
solver="liblinear",
class_weight="balanced",
max_iter=5000,
random_state=RANDOM_STATE
))
])
# --- 3) Evaluate with nested CV + return the fitted estimators to study selections ---
scoring = {
"ap": "average_precision",
"roc": "roc_auc",
"f1": "f1",
"mcc": make_scorer(matthews_corrcoef)
}
# Reports AUPRC, ROC‑AUC, F1, MCC across the 5 outer folds.
cv_res = cross_validate( pipe, X, y, cv=outer_cv,
scoring=scoring,
return_estimator=True, n_jobs=-1
)
print(
f"AUPRC (mean±sd): {cv_res['test_ap'].mean():.3f} ± {cv_res['test_ap'].std():.3f}\n"
f"ROC-AUC (mean±sd): {cv_res['test_roc'].mean():.3f} ± {cv_res['test_roc'].std():.3f}\n"
f"F1 (mean±sd): {cv_res['test_f1'].mean():.3f} ± {cv_res['test_f1'].std():.3f}\n"
f"MCC (mean±sd): {cv_res['test_mcc'].mean():.3f} ± {cv_res['test_mcc'].std():.3f}"
)
# --- 4) Inspect which features are repeatedly selected across outer folds ---
# For each feature, compute the fraction of outer folds in which it was selected → selection frequency in [0,1].
support_matrix = np.vstack([est.named_steps["rfe"].support_ for est in cv_res["estimator"]])
freq = support_matrix.mean(axis=0) # selection frequency in [0,1]
selection_frequency = (
pd.Series(freq, index=feature_names, name="selection_freq")
.sort_values(ascending=False)
)
# Thresholding heuristic for a stable final set:
# keep features selected in at least 60% of outer folds (tune if needed)
stable_mask = selection_frequency >= 0.60
selected_cols_rfe = selection_frequency.index[stable_mask].tolist()
print(f"\nRFECV kept (median across folds): "
f"{int(np.median([est.named_steps['rfe'].n_features_ for est in cv_res['estimator']]))} features")
print(f"Stable features (≥60% of folds): {len(selected_cols_rfe)}")
print(selection_frequency.head(20)) # top-20 by stability
# selection_frequency.to_csv("rfe_feature_stability.csv", index=True)
# --- 5) (Optional) Refit a final model on ALL data using the stable subset ---
# You can swap LogisticRegression with XGBClassifier/RandomForest if preferred.
final_model = LogisticRegression(
penalty="l2", solver="liblinear", class_weight="balanced",
max_iter=5000, random_state=RANDOM_STATE
)
final_model.fit(X[selected_cols_rfe], y)
# If you want to carry the selected set forward in your notebook:
# selected_cols = selected_cols_rfe
AUPRC (mean±sd): 0.161 ± 0.042 ROC-AUC (mean±sd): 0.670 ± 0.088 F1 (mean±sd): 0.204 ± 0.070 MCC (mean±sd): 0.137 ± 0.088 RFECV kept (median across folds): 55 features Stable features (≥60% of folds): 64 59 1.0 56 1.0 64 1.0 151 1.0 155 1.0 204 1.0 65 0.8 25 0.8 121 0.8 117 0.8 130 0.8 124 0.8 45 0.8 67 0.8 129 0.8 61 0.8 471 0.8 197 0.8 172 0.8 205 0.8 Name: selection_freq, dtype: float64
LogisticRegression(class_weight='balanced', max_iter=5000, random_state=42,
solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', max_iter=5000, random_state=42,
solver='liblinear')Visualize Feature-selection stability¶
# ---- 1A. "Scree"-like line: selection frequency sorted descending ----
sf = selection_frequency.sort_values(ascending=False)
k_stable = int((sf >= 0.60).sum())
plt.figure(figsize=(8, 4))
plt.plot(range(1, len(sf) + 1), sf.values, marker='o', linewidth=1)
plt.axhline(0.60, linestyle='--')
plt.axvline(k_stable, linestyle='--')
plt.title("RFE selection stability (outer CV)")
plt.xlabel("Feature rank by stability")
plt.ylabel("Selection frequency")
plt.tight_layout()
plt.show()
# ---- 1B. Top‑k bar chart ----
top_k = min(20, len(sf))
top = sf.head(top_k)[::-1] # reverse for horizontal barh
plt.figure(figsize=(8, max(4, top_k * 0.28)))
plt.barh(top.index.astype(str), top.values)
plt.xlabel("Selection frequency")
plt.title(f"Top {top_k} features by RFE stability")
plt.tight_layout()
plt.show()
Comment:
# cv_res contains test scores for each outer fold
metrics_df = pd.DataFrame({
"AUPRC": cv_res["test_ap"],
"ROC-AUC": cv_res["test_roc"],
"F1": cv_res["test_f1"],
"MCC": cv_res["test_mcc"],
})
plt.figure(figsize=(7, 4))
plt.boxplot([metrics_df[c].values for c in metrics_df.columns], labels=list(metrics_df.columns))
plt.ylabel("Score")
plt.title("Outer-CV performance with RFE pipeline")
plt.tight_layout()
plt.show()
What is the next step?¶
- RFECV favored compact subsets (median 20 features), and selection‑frequency analysis revealed a stable core (5 features at 100%, 7 at 80%). To prevent reinserting correlated surrogates and to reduce variance, we fixed the final subset at the top‑20 by stability. We also evaluated the 31‑feature union (≥60% frequency) as a robustness check; results were comparable, so we report 20 features for parsimony and interpretability.
- I included PCA (retain 90–95% variance) as an unsupervised baseline; manifold methods (t‑SNE/MDS) were used solely for visualization due to their lack of stable, out‑of‑sample transforms.
# ============================================================
# Select TOP-K features by outer-CV AUPRC, then finalize model
# ============================================================
# --- 0) Ensure CV / scoring objects ---
if "outer_cv" not in globals():
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
"ap": "average_precision",
"roc": "roc_auc",
"f1": "f1",
"mcc": make_scorer(matthews_corrcoef),
}
# --- 1) Choose the candidate pool to rank (pick one) ---
assert "selection_frequency" in globals() and "selected_cols_rfe" in globals(), "Run the RFE block first."
CANDIDATE_POOL = "rfe_stable" # options: "rfe_stable", "rfe20", "all"
if CANDIDATE_POOL == "rfe_stable":
candidates = list(selected_cols_rfe) # ≥60% stable set (e.g., ~31)
elif CANDIDATE_POOL == "rfe20":
candidates = selection_frequency.sort_values(ascending=False).head(20).index.tolist()
elif CANDIDATE_POOL == "all":
candidates = list(X.columns)
else:
raise ValueError("CANDIDATE_POOL must be one of {'rfe_stable','rfe20','all'}")
# --- 2) Set k (how many features to keep) ---
k = 20 # <-- change this to the number of features you want
k = min(k, len(candidates))
# --- 3) Define the univariate evaluation pipeline (Impute->Scale->Logistic) ---
uni_pipe = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
("clf", LogisticRegression(
penalty="l2", solver="liblinear", class_weight="balanced",
max_iter=5000, random_state=RANDOM_STATE
))
])
# --- 4) Score each feature individually by AUPRC using the same outer CV ---
rows = []
for f in candidates:
Xi = X[[f]].copy()
ap_scores = cross_val_score(uni_pipe, Xi, y, cv=outer_cv, scoring="average_precision", n_jobs=-1)
rows.append({"feature": f, "ap_mean": ap_scores.mean(), "ap_sd": ap_scores.std()})
rank_df = pd.DataFrame(rows).sort_values(["ap_mean", "ap_sd"], ascending=[False, True]).reset_index(drop=True)
topk_features = rank_df.head(k)["feature"].tolist()
print(f"\nTop-{k} features by outer-CV AUPRC from pool '{CANDIDATE_POOL}':")
print(rank_df.head(k))
# --- 5) Build the final k-feature model (no PCA), evaluate, and pick threshold by OOF-F1 ---
final_pipe_k = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
("clf", LogisticRegression(
penalty="l2", solver="liblinear", class_weight="balanced",
max_iter=5000, random_state=RANDOM_STATE
))
])
X_k = X[topk_features].copy()
# 5a) Outer-CV performance snapshot (optional, for reporting)
cv_perf = cross_validate(final_pipe_k, X_k, y, cv=outer_cv, scoring=scoring, n_jobs=-1)
print(
f"\n[Top-{k} model | outer CV]\n"
f"AUPRC (mean±sd): {cv_perf['test_ap'].mean():.3f} ± {cv_perf['test_ap'].std():.3f}\n"
f"ROC-AUC (mean±sd): {cv_perf['test_roc'].mean():.3f} ± {cv_perf['test_roc'].std():.3f}\n"
f"F1 (mean±sd): {cv_perf['test_f1'].mean():.3f} ± {cv_perf['test_f1'].std():.3f}\n"
f"MCC (mean±sd): {cv_perf['test_mcc'].mean():.3f} ± {cv_perf['test_mcc'].std():.3f}"
)
# 5b) Pick a deployable operating threshold by maximizing F1 on OOF probabilities (no leakage)
oof_prob = cross_val_predict(final_pipe_k, X_k, y, cv=outer_cv, method="predict_proba", n_jobs=-1)[:, 1]
p, r, thr = precision_recall_curve(y, oof_prob)
f1 = 2 * p[1:] * r[1:] / (p[1:] + r[1:] + 1e-12)
thr_star = float(thr[np.argmax(f1)])
print(f"Chosen operating threshold (max F1 on OOF): {thr_star:.3f}")
# --- 6) Fit the final model on ALL data (ready for deployment) ---
final_pipe_k.fit(X_k, y)
# --- 7) Prediction helper for new data ---
def predict_with_threshold_topk(pipe, X_new_df, features=topk_features, threshold=thr_star):
Xn = X_new_df[features].copy()
prob = pipe.predict_proba(Xn)[:, 1]
yhat = (prob >= threshold).astype("int8")
return yhat, prob
Top-20 features by outer-CV AUPRC from pool 'rfe_stable': feature ap_mean ap_sd 0 59 0.180647 0.037314 1 64 0.171111 0.108788 2 21 0.156464 0.045563 3 348 0.155786 0.066097 4 205 0.149234 0.015680 5 65 0.147031 0.078618 6 124 0.138425 0.068338 7 121 0.138024 0.066722 8 163 0.136071 0.049017 9 25 0.130879 0.034320 10 197 0.129044 0.038366 11 123 0.128341 0.059613 12 100 0.125632 0.055898 13 99 0.123956 0.042941 14 573 0.121167 0.033835 15 164 0.116457 0.032798 16 26 0.115988 0.037675 17 129 0.113052 0.018461 18 138 0.111983 0.031427 19 565 0.106037 0.041527 [Top-20 model | outer CV] AUPRC (mean±sd): 0.246 ± 0.048 ROC-AUC (mean±sd): 0.778 ± 0.029 F1 (mean±sd): 0.261 ± 0.023 MCC (mean±sd): 0.225 ± 0.033 Chosen operating threshold (max F1 on OOF): 0.762
3. Feature Extraction¶
SMOTE Logistic Regression¶
# --- ASSUMED DATA PREPARATION ---
# X_df is the RFE-selected top 20 features
X_df = X_k.copy()
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values
RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "3rd Stage: Top 20 RFE Features"
print(f"======== {STAGE_NAME} ({X_df.shape[1]} features) ========")
# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
X_df, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)
# 2. Preprocessing Pipeline: Impute (Median) -> Scale
preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)
# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")
# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)
# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)
# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)
print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC: {auprc:.4f}")
# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))
# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")
plt.grid(False)
# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)
plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 3rd Stage: Top 20 RFE Features (20 features) ======== Train size: 975 -> Resampled size: 1812 --- Performance Metrics (Test Set) --- F1-score: 0.2326 AUPRC: 0.1628
Apply cluster method for top 20 features¶
Xk_reset = X_k.reset_index(drop=True)
Xk_reset = np.array(Xk_reset)
Xk_reset
array([[ 8.07300e-01, 1.91927e+01, -5.44150e+03, ..., -9.46000e-02,
5.62000e+01, 1.19900e-01],
[ 2.38245e+01, 1.61755e+01, -5.44775e+03, ..., -1.89200e-01,
4.51001e+01, 6.21900e-01],
[ 2.43791e+01, 1.56209e+01, -5.46825e+03, ..., 2.83800e-01,
4.78000e+01, 1.63000e-01],
...,
[ 5.63820e+00, 1.43618e+01, -5.68550e+03, ..., -9.93400e-01,
2.79001e+01, 1.18500e-01],
[-3.18180e+00, 2.31818e+01, -5.28100e+03, ..., -9.93400e-01,
5.25999e+01, 8.77000e-02],
[ 3.06820e+00, 1.69318e+01, -5.72700e+03, ..., -5.67700e-01,
2.20000e+01, 8.77000e-02]])
KMEANS METHOD¶
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(Xk_reset)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
kmeans_k = KMeans(n_clusters = 2, init = 'k-means++', random_state = 42)
yk_kmeans = kmeans_k.fit_predict(Xk_reset)
pd.Series(yk_kmeans).value_counts()
| count | |
|---|---|
| 0 | 1035 |
| 1 | 358 |
plt.scatter(Xk_reset[yk_kmeans == 0, 0], Xk_reset[yk_kmeans == 0, 1], s = 15, c = 'lightskyblue', label = 'Cluster 1')
plt.scatter(Xk_reset[yk_kmeans == 1, 0], Xk_reset[yk_kmeans == 1, 1], s = 15, c = 'orange', label = 'Cluster 2')
plt.scatter(kmeans_k.cluster_centers_[:, 0], kmeans_k.cluster_centers_[:, 1], s = 20, c = 'red', label = 'Centroids')
plt.title('Clusters of the test')
plt.xlabel('Component 1')
plt.ylabel('Component 1')
plt.legend()
plt.show()
Xk_scaled = scaler.fit_transform(X_k)
Xk_scaled_df = pd.DataFrame(Xk_scaled, columns=X_k.columns)
# --- Step 1: Apply 2D t-SNE ---
# n_components=2 for 2D visualization
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_2d = tsne_2d.fit_transform(Xk_scaled_df)
# Convert to DataFrame for easier plotting
Xk_tsne_df = pd.DataFrame(data = Xk_tsne_2d, columns = ['tSNE_1', 'tSNE_2'])
Xk_tsne_df['KMeans_Cluster'] = yk_kmeans # Add cluster labels
# --- Step 2: Plot the 2D t-SNE results ---
plt.figure(figsize=(8, 6))
for cluster in sorted(Xk_tsne_df['KMeans_Cluster'].unique()):
subset = Xk_tsne_df[Xk_tsne_df['KMeans_Cluster'] == cluster]
plt.scatter(
subset['tSNE_1'],
subset['tSNE_2'],
s = 10,
label = f'Cluster {cluster}'
)
plt.title('K-Means Clusters Visualized by 2D t-SNE')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()
# Assuming X_scaled_df and y_kmeans are ready
# --- Step 1: Apply 3D t-SNE ---
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_3d = tsne_3d.fit_transform(Xk_scaled_df)
# Convert to DataFrame
Xk_tsne_df_3d = pd.DataFrame(data = Xk_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Xk_tsne_df_3d['KMeans_Cluster'] = yk_kmeans
# Define the color for the clusters
cluster_colors = {
0: 'orange',
1: 'lightskyblue',
}
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
for cluster_label, color_name in cluster_colors.items():
subset = Xk_tsne_df_3d[Xk_tsne_df_3d['KMeans_Cluster'] == cluster_label]
ax.scatter(
subset['tSNE_1'],
subset['tSNE_2'],
subset['tSNE_3'],
s=10,
c=color_name, # <<< Use the specific color here
label=f'Cluster {cluster_label}',
alpha=0.7
)
ax.set_title('K-Means Clusters Visualized by 3D t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.legend()
plt.show()
# Apply 3D t-SNE
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_3d = tsne_3d.fit_transform(Xk_scaled_df)
# Convert to DataFrame
Xk_tsne_df_3d = pd.DataFrame(data = Xk_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Xk_tsne_df_3d['KMeans_Cluster'] = yk_kmeans
# Define the color for the clusters
cluster_colors = {
0: 'lightskyblue',
1: 'orange',
}
# Ensure the cluster column is treated as a string/categorical for distinct coloring
Xk_tsne_df_3d['KMeans_Cluster_Str'] = Xk_tsne_df_3d['KMeans_Cluster'].astype(str)
# Define the color of the Kmeans cluster
specific_plotly_colors = {
'0': 'lightskyblue',
'1': 'orange',
}
# Create the interactive 3D scatter plot using Plotly Express
fig = px.scatter_3d(
Xk_tsne_df_3d,
x='tSNE_1',
y='tSNE_2',
z='tSNE_3',
color='KMeans_Cluster_Str', # Color points by cluster
color_discrete_map=specific_plotly_colors, # <<< Assign specific colors
title='Interactive 3D K-Means Clusters (Plotly)',
opacity=0.7, # Transparency level
size_max=5, # <<< This controls the dot size (max diameter in pixels)
labels={'tSNE_1': 't-SNE Component 1',
'tSNE_2': 't-SNE Component 2',
'tSNE_3': 't-SNE Component 3'}
)
# You can further customize the marker size and styling if needed:
fig.update_traces(marker=dict(size=3)) # <<< Sets a fixed size for all markers
fig.show() # This displays the rotatable plot in Google Colab
DBSCAN and t-SNE¶
# Import necessary libraries
from sklearn.cluster import DBSCAN
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# 1. --- FIT DBSCAN MODEL ---
min_samples_start_k = int(np.log(Xk_scaled_df.shape[0]))
eps_start = 3 # Start with a large value and tune down, or use a k-distance plot.
dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start_k)
secom_clusters = dbscan.fit_predict(Xk_scaled_df)
print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")
# 2. --- PREPARE FOR VISUALIZATION (using t-SNE) ---
# DBSCAN was fit on the high-dimensional data, but we need 2D data for a plot.
# We will use t-SNE for dimensionality reduction for visualization.
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_2d = tsne_2d.fit_transform(Xk_scaled_df)
# Create a DataFrame for plotting
df_secom = pd.DataFrame(Xk_tsne_2d, columns=['tSNE_1', 'tSNE_2'])
df_secom['Cluster'] = secom_clusters
# 3. --- PLOT THE DBSCAN CLUSTERS (using t-SNE components) ---
plt.figure(figsize=(10, 8))
plt.scatter(
df_secom['tSNE_1'],
df_secom['tSNE_2'],
c=df_secom['Cluster'],
cmap='viridis', # 'viridis' often works well for many colors
s=10, # Minimized dot size
alpha=0.7
)
plt.title(f'DBSCAN Clusters on SECOM Data (Visualized with 2D t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1 0 1] Cluster Distribution (Tuning Required): 0 1287 -1 97 1 9 Name: count, dtype: int64
from mpl_toolkits.mplot3d import Axes3D # REQUIRED for 3D projection
# --- 1. FIT DBSCAN MODEL (No change required here) ---
# Assuming Xk_scaled_df is your high-dimensional, scaled feature data
min_samples_start_k = int(np.log(Xk_scaled_df.shape[0]))
eps_start = 3
dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start_k)
secom_clusters = dbscan.fit_predict(Xk_scaled_df)
print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")
# 2. --- PREPARE FOR VISUALIZATION (using 3D t-SNE) ---
# Change n_components from 2 to 3
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_3d = tsne_3d.fit_transform(Xk_scaled_df)
# Create a DataFrame for plotting, adding the third component
df_secom_3d = pd.DataFrame(Xk_tsne_3d, columns=['tSNE_1', 'tSNE_2', 'tSNE_3'])
df_secom_3d['Cluster'] = secom_clusters
# 3. --- PLOT THE DBSCAN CLUSTERS (using 3D t-SNE components) ---
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d') # <<< KEY CHANGE: Adds 3D axis projection
# Plotting the clusters
scatter = ax.scatter(
df_secom_3d['tSNE_1'],
df_secom_3d['tSNE_2'],
df_secom_3d['tSNE_3'], # <<< ADDING THE THIRD COMPONENT
c=df_secom_3d['Cluster'],
cmap='viridis',
s=10,
alpha=0.7
)
ax.set_title(f'DBSCAN Clusters on SECOM Data (Visualized with 3D t-SNE)')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3') # <<< ADDING Z-AXIS LABEL
# To display the color bar correctly for 3D Matplotlib
plt.colorbar(scatter, label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1 0 1] Cluster Distribution (Tuning Required): 0 1287 -1 97 1 9 Name: count, dtype: int64
4. DATA IMBALANCE by Autoencoders combined with Data Sampling¶
# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, random_split
# Scikit-learn
from sklearn.metrics import (
confusion_matrix,
classification_report,
average_precision_score,
precision_recall_curve,
PrecisionRecallDisplay,
ConfusionMatrixDisplay
)
# Sampling methods
from imblearn.over_sampling import SMOTE, RandomOverSampler, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# System and utilities
import warnings
warnings.filterwarnings('ignore')
Autoencoder¶
# =========================================================================
# 0. REPRODUCIBILITY: SET SEEDS
# =========================================================================
SEED = 42 # pick any integer you like
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
# (Optional but recommended for strict reproducibility)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# =========================================================================
# 1. DATA PREPARATION (SECOM specific changes)
# =========================================================================
# --- Assuming data is loaded and available: ---
# Xk_scaled: NumPy array of SECOM features (N samples x 590 features)
# y_secom_target: NumPy array of 'Pass/Fail' labels (used for coloring the plot)
Xk_scaled = scaler.fit_transform(X_k)
# X_secom_scaled_array = scaler.fit_transform(X)
y_secom_target = secom_upd['Pass/Fail'].values
# NOTE: Load your cleaned, scaled SECOM feature array (Xk_scaled)
# and the target array (y_secom_target) here.
# We estimate the input dimension based on your previous steps (~590 features).
INPUT_DIM = Xk_scaled.shape[1]
LATENT_DIM = 10
print(f"Detected Input Dimension from data: {INPUT_DIM}") # Check the output!
# Convert to tensor and create DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_tensor = torch.tensor(Xk_scaled, dtype=torch.float32).to(device)
dataset = torch.utils.data.TensorDataset(X_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=True)
g = torch.Generator()
g.manual_seed(SEED)
loader = torch.utils.data.DataLoader(
dataset,
batch_size=256,
shuffle=True,
generator=g # PyTorch 1.7+ supports this
)
# =========================================================================
# 2. MODEL DEFINITION (Scaled for High Dimensionality)
# =========================================================================
class DeepAutoencoderSECOM(nn.Module):
def __init__(self, input_dim=INPUT_DIM, latent_dim=LATENT_DIM):
super(DeepAutoencoderSECOM, self).__init__()
# Scaling layers for 590+ features, creating a wider funnel
hidden_1 = 512
hidden_2 = 128
hidden_3 = 64
# Encoder: 4 layers reducing down to the 2D bottleneck
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_1), # 589 -> 512
nn.ReLU(),
nn.Linear(hidden_1, hidden_2), # 512 -> 128
nn.ReLU(),
nn.Linear(hidden_2, hidden_3), # 128 -> 64
nn.ReLU(),
nn.Linear(hidden_3, latent_dim) # 64 -> 2D bottleneck
)
# Decoder: 4 layers reconstructing the input
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_3), # 8 -> 64
nn.ReLU(),
nn.Linear(hidden_3, hidden_2), # 64 -> 128
nn.ReLU(),
nn.Linear(hidden_2, hidden_1), # 128 -> 512
nn.ReLU(),
nn.Linear(hidden_1, input_dim) # 512 -> 589
)
def forward(self, x):
z = self.encoder(x)
x_hat = self.decoder(z)
return z, x_hat
def encode(self, x):
"""Method to explicitly get the latent representation Z."""
return self.encoder(x)
# =========================================================================
# 3. TRAINING & PLOTTING
# =========================================================================
# Initialize model, optimizer, and loss function
model = DeepAutoencoderSECOM(input_dim=INPUT_DIM, latent_dim=LATENT_DIM).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
# Training loop
epochs = 50 # Reduced epochs for faster execution/testing. Increase this for final training.
model.train()
print(f"Starting training on {INPUT_DIM} features with {epochs} epochs...")
for epoch in range(epochs):
total_loss = 0
for batch in loader:
batch_x = batch[0]
optimizer.zero_grad()
z, x_hat = model(batch_x)
loss = criterion(x_hat, batch_x)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(loader)
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss={avg_loss:.6f}")
# Extract latent representation from encoder
model.eval()
with torch.no_grad():
Z, _ = model(X_tensor)
Z_np = Z.cpu().numpy()
# Plot latent 2D bottleneck space, colored by Pass/Fail target
plt.figure(figsize=(8, 7))
# Use the original SECOM labels (y_secom_target) for coloring
# Assuming 1 is 'Fail' and -1 is 'Pass' (based on your initial image)
# We use a custom cmap for clarity
plt.scatter(Z_np[:, 0], Z_np[:, 1], c=y_secom_target,
cmap='coolwarm', s=10, alpha=0.7)
plt.title(f"SECOM Data - Deep Autoencoder Latent Space ({LATENT_DIM}D)")
plt.xlabel("Latent Component 1")
plt.ylabel("Latent Component 2")
plt.colorbar(label="Pass/Fail Label (-1 or 1)")
plt.grid(True)
plt.show()
Detected Input Dimension from data: 20 Starting training on 20 features with 50 epochs... Epoch 5/50, Loss=0.569383 Epoch 10/50, Loss=0.360770 Epoch 15/50, Loss=0.228854 Epoch 20/50, Loss=0.153011 Epoch 25/50, Loss=0.121260 Epoch 30/50, Loss=0.100450 Epoch 35/50, Loss=0.091366 Epoch 40/50, Loss=0.077644 Epoch 45/50, Loss=0.073149 Epoch 50/50, Loss=0.060785
# =========================================================================
# 4. Calculate Final Reconstruction Error on the entire dataset
# =========================================================================
model.eval()
with torch.no_grad():
# Pass the entire dataset (X_tensor) through the model
_, X_hat_final = model(X_tensor)
# Calculate the MSE loss between the final reconstruction and the original data
final_reconstruction_error = criterion(X_hat_final, X_tensor).item()
print(f"\nFinal Reconstruction MSE Error on Full Dataset: {final_reconstruction_error:.6f}")
Final Reconstruction MSE Error on Full Dataset: 0.058626
Extract the latent variables from Autoencoder¶
Z, _ = model(X_tensor)
# Conversion to Numpy Array
Z_np = Z.detach().cpu().numpy()
print(Z_np)
Z_np.shape
[[-1.386474 -0.90422934 -0.30455133 ... 0.4420718 1.2684146 0.44053203] [-2.319976 -0.40148574 -2.726464 ... 5.366594 -0.46131426 3.8805861 ] [ 0.37873235 -2.282571 -2.041797 ... -1.9194677 3.1717741 -3.6053414 ] ... [-0.6914102 -2.5044188 -0.21591204 ... -0.3063707 3.9818249 0.9781686 ] [-0.42631987 -2.4065845 -0.10155663 ... 0.7870069 3.729516 0.15664092] [-0.9364726 -2.5117254 0.86905694 ... -1.5053236 3.3884928 0.8203121 ]]
(1393, 10)
SMOTE Logistic Regression¶
# --- ASSUMED DATA PREPARATION ---
# X_df is the 8 latent variables from Autoencoder output
X_df = pd.DataFrame(Z_np, index=secom_upd.index) # Convert NumPy array Z_np to DataFrame
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values
RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "4th Stage: Autoencoder Latent Z (10 Features)"
print(f"======== {STAGE_NAME} ({X_df.shape[1]} features) ========")
# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
X_df, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)
# 2. Preprocessing Pipeline: Impute (Median) -> Scale
# Scaling is particularly important for Latent Variables
preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)
# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)
print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")
# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)
# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)
# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)
print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC: {auprc:.4f}")
# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))
# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")
plt.grid(False)
# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)
plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 4th Stage: Autoencoder Latent Z (10 Features) (10 features) ======== Train size: 975 -> Resampled size: 1812 --- Performance Metrics (Test Set) --- F1-score: 0.2238 AUPRC: 0.1192
Apply cluster method for 10 latent variables¶
Elbow¶
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(Z_np)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Kmeans¶
kmeans = KMeans(n_clusters = 2, init = 'k-means++', random_state = 19)
y_kmeans = kmeans.fit_predict(Z_np)
pd.Series(y_kmeans).value_counts()
| count | |
|---|---|
| 1 | 1356 |
| 0 | 37 |
plt.scatter(Z_np[y_kmeans == 0, 0], Z_np[y_kmeans == 0, 1], s = 15, c = 'lightskyblue', label = 'Cluster 1')
plt.scatter(Z_np[y_kmeans == 1, 0], Z_np[y_kmeans == 1, 1], s = 15, c = 'orange', label = 'Cluster 2')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 12, c = 'red', label = 'Centroids')
plt.title('Clusters of the test')
plt.xlabel('Component 1')
plt.ylabel('Component 1')
plt.legend()
plt.show()
Kmeans and t-SNE¶
Z_scaled = scaler.fit_transform(Z_np)
Z_scaled_df = pd.DataFrame(Z_scaled)
# --- Step 1: Apply 2D t-SNE ---
# n_components=2 for 2D visualization
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_2d = tsne_2d.fit_transform(Z_scaled_df)
# Convert to DataFrame for easier plotting
Z_tsne_df = pd.DataFrame(data = Z_tsne_2d, columns = ['tSNE_1', 'tSNE_2'])
Z_tsne_df['KMeans_Cluster'] = y_kmeans # Add cluster labels
# --- Step 2: Plot the 2D t-SNE results ---
plt.figure(figsize=(8, 6))
for cluster in sorted(Z_tsne_df['KMeans_Cluster'].unique()):
subset = Z_tsne_df[Z_tsne_df['KMeans_Cluster'] == cluster]
plt.scatter(
subset['tSNE_1'],
subset['tSNE_2'],
s = 10,
label = f'Cluster {cluster}'
)
plt.title('K-Means Clusters Visualized by 2D t-SNE')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()
# Assuming X_scaled_df and y_kmeans are ready
# --- Step 1: Apply 3D t-SNE ---
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_3d = tsne_3d.fit_transform(Z_scaled_df)
# Convert to DataFrame
Z_tsne_df_3d = pd.DataFrame(data = Z_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Z_tsne_df_3d['KMeans_Cluster'] = y_kmeans
# Define the color for the clusters
cluster_colors = {
0: 'lightskyblue',
1: 'orange',
}
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
for cluster_label, color_name in cluster_colors.items():
subset = X_tsne_df_3d[Z_tsne_df_3d['KMeans_Cluster'] == cluster_label]
ax.scatter(
subset['tSNE_1'],
subset['tSNE_2'],
subset['tSNE_3'],
s=10,
c=color_name, # <<< Use the specific color here
label=f'Cluster {cluster_label}',
alpha=0.7
)
ax.set_title('K-Means Clusters Visualized by 3D t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.legend()
plt.show()
# Apply 3D t-SNE
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_3d = tsne_3d.fit_transform(Z_scaled_df)
# Convert to DataFrame
Z_tsne_df_3d = pd.DataFrame(data = Z_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Z_tsne_df_3d['KMeans_Cluster'] = y_kmeans
# Define the color for the clusters
cluster_colors = {
0: 'lightskyblue',
1: 'orange',
}
# Ensure the cluster column is treated as a string/categorical for distinct coloring
Z_tsne_df_3d['KMeans_Cluster_Str'] = Z_tsne_df_3d['KMeans_Cluster'].astype(str)
# Define the color of the Kmeans cluster
specific_plotly_colors = {
'0': 'lightskyblue',
'1': 'orange',
}
# Create the interactive 3D scatter plot using Plotly Express
fig = px.scatter_3d(
Z_tsne_df_3d,
x='tSNE_1',
y='tSNE_2',
z='tSNE_3',
color='KMeans_Cluster_Str', # Color points by cluster
color_discrete_map=specific_plotly_colors, # <<< Assign specific colors
title='Interactive 3D K-Means Clusters (Plotly)',
opacity=0.7, # Transparency level
size_max=5, # <<< This controls the dot size (max diameter in pixels)
labels={'tSNE_1': 't-SNE Component 1',
'tSNE_2': 't-SNE Component 2',
'tSNE_3': 't-SNE Component 3'}
)
# You can further customize the marker size and styling if needed:
fig.update_traces(marker=dict(size=3)) # <<< Sets a fixed size for all markers
fig.show() # This displays the rotatable plot in Google Colab
DBSCAN AND t-SNE¶
# 1. --- FIT DBSCAN MODEL ---
min_samples_start_k = int(np.log(Z_scaled_df.shape[0]))
eps_start = 2 # Start with a large value and tune down, or use a k-distance plot.
dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start_k)
secom_clusters = dbscan.fit_predict(Z_scaled_df)
print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")
# 2. --- PREPARE FOR VISUALIZATION (using t-SNE) ---
# DBSCAN was fit on the high-dimensional data, but we need 2D data for a plot.
# We will use t-SNE for dimensionality reduction for visualization.
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_2d = tsne_2d.fit_transform(Z_scaled_df)
# Create a DataFrame for plotting
df_secom = pd.DataFrame(Z_tsne_2d, columns=['tSNE_1', 'tSNE_2'])
df_secom['Cluster'] = secom_clusters
# 3. --- PLOT THE DBSCAN CLUSTERS (using t-SNE components) ---
plt.figure(figsize=(10, 8))
plt.scatter(
df_secom['tSNE_1'],
df_secom['tSNE_2'],
c=df_secom['Cluster'],
cmap='viridis', # 'viridis' often works well for many colors
s=10, # Minimized dot size
alpha=0.7
)
plt.title(f'DBSCAN Clusters on SECOM Data (Visualized with 2D t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1 0 1] Cluster Distribution (Tuning Required): 0 1304 -1 80 1 9 Name: count, dtype: int64
# --- 1. FIT DBSCAN MODEL (No change required here) ---
# Assuming Xk_scaled_df is your high-dimensional, scaled feature data
min_samples_start = int(np.log(Z_scaled_df.shape[0]))
eps_start = 2
dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start)
secom_clusters = dbscan.fit_predict(Z_scaled_df)
print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")
# 2. --- PREPARE FOR VISUALIZATION (using 3D t-SNE) ---
# Change n_components from 2 to 3
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_3d = tsne_3d.fit_transform(Z_scaled_df)
# Create a DataFrame for plotting, adding the third component
df_secom_3d = pd.DataFrame(Z_tsne_3d, columns=['tSNE_1', 'tSNE_2', 'tSNE_3'])
df_secom_3d['Cluster'] = secom_clusters
# 3. --- PLOT THE DBSCAN CLUSTERS (using 3D t-SNE components) ---
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d') # <<< KEY CHANGE: Adds 3D axis projection
# Plotting the clusters
scatter = ax.scatter(
df_secom_3d['tSNE_1'],
df_secom_3d['tSNE_2'],
df_secom_3d['tSNE_3'], # <<< ADDING THE THIRD COMPONENT
c=df_secom_3d['Cluster'],
cmap='viridis',
s=10,
alpha=0.7
)
ax.set_title(f'DBSCAN Clusters on SECOM Data (Visualized with 3D t-SNE)')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3') # <<< ADDING Z-AXIS LABEL
# To display the color bar correctly for 3D Matplotlib
plt.colorbar(scatter, label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1 0 1] Cluster Distribution (Tuning Required): 0 1304 -1 80 1 9 Name: count, dtype: int64